Comparing concat(), merge(), and join() in Pandas

A comprehensive guide to understanding the differences between these three essential Pandas functions with detailed examples and explanations.

Tables Used in the Examples

The following tables will be used in all examples in this guide. They contain realistic and useful data to demonstrate the differences between concat(), merge(), and join().

Table 1: Employees
Employee ID Name Department
101 Fatimah Al-Zahrani Finance
102 Ahmed Al-Zahrani Human Resources
103 Mohammed Al-Zahrani IT
104 Sheikha Al-Zahrani Marketing
Table 2: Salaries
Employee ID Salary Bonus
101 5000 500
102 6000 400
103 7000 600
105 8000 700

The following examples will use these two tables to clarify the differences between concat(), merge(), and join(). Understanding these tables is essential to follow the examples.

1. The concat() Function

Purpose: The concat() function is used to concatenate two or more DataFrames along a specific axis, either rows or columns. It is commonly used for combining datasets when you want to stack or add new rows/columns.

Key Features:

Limitations:

  • Requires matching data structure.
  • Does not support merging on columns.

Example Code:

 
import pandas as pd

# DataFrames based on the example tables
df_employees = pd.DataFrame({
    'Employee ID': [101, 102, 103, 104],
    'Name': ['Fatimah Al-Zahrani', 'Ahmed Al-Zahrani', 'Mohammed Al-Zahrani', 'Sheikha Al-Zahrani'],
    'Department': ['Finance', 'Human Resources', 'IT', 'Marketing']
})

df_salaries = pd.DataFrame({
    'Employee ID': [101, 102, 103, 105],
    'Salary': [5000, 6000, 7000, 8000],
    'Bonus': [500, 400, 600, 700]
})

# Concatenate along rows (axis=0)
result = pd.concat([df_employees, df_salaries], ignore_index=True)

print(result)

      

Output :

    Employee ID                Name          Department  Salary  Bonus
0          101  Fatimah Al-Zahrani             Finance     NaN    NaN
1          102    Ahmed Al-Zahrani     Human Resources     NaN    NaN
2          103  Mohammed Al-Zahrani                  IT     NaN    NaN
3          104   Sheikha Al-Zahrani          Marketing     NaN    NaN
4          101                  NaN                NaN   5000    500
5          102                  NaN                NaN   6000    400
6          103                  NaN                NaN   7000    600
7          105                  NaN                NaN   8000    700
      

Explanation: In this example, the concat() function is used to combine the two tables, Employees and Salaries. Since they have different columns, NaN values are inserted where data is missing. This demonstrates how concatenation works by stacking the rows of one DataFrame below the other.

2. The merge() Function

Key Features:

Limitations:

  • Slower compared to concat() in some cases.

Example Code:

 
import pandas as pd
# DataFrames based on the example tables
df_employees = pd.DataFrame({
    'Employee ID': [101, 102, 103, 104],
    'Name': ['Fatimah Al-Zahrani', 'Ahmed Al-Zahrani', 'Mohammed Al-Zahrani', 'Sheikha Al-Zahrani'],
    'Department': ['Finance', 'Human Resources', 'IT', 'Marketing']
})

df_salaries = pd.DataFrame({
    'Employee ID': [101, 102, 103, 105],
    'Salary': [5000, 6000, 7000, 8000],
    'Bonus': [500, 400, 600, 700]
})

# Merge the DataFrames on 'Employee ID' column (inner join by default)
result_inner = pd.merge(df_employees, df_salaries, on='Employee ID')

# Merge the DataFrames with a left join
result_left = pd.merge(df_employees, df_salaries, on='Employee ID', how='left')

print("Inner Join Result:")
print(result_inner)

print("\nLeft Join Result:")
print(result_left)

      

Output :(Inner Join)

   Employee ID                Name          Department  Salary  Bonus
0          101  Fatimah Al-Zahrani             Finance   5000    500
1          102    Ahmed Al-Zahrani     Human Resources   6000    400
2          103  Mohammed Al-Zahrani                  IT   7000    600
      

Output :(Left Join)

   Employee ID                Name          Department  Salary  Bonus
0          101  Fatimah Al-Zahrani             Finance   5000  500.0
1          102    Ahmed Al-Zahrani     Human Resources   6000  400.0
2          103  Mohammed Al-Zahrani                  IT   7000  600.0
3          104   Sheikha Al-Zahrani          Marketing    NaN    NaN
      

Explanation:

  • In the inner join, only rows with matching Employee ID values in both DataFrames are included. Thus, the row for Employee ID 105 from df_salaries and Employee ID 104 from df_employees are excluded.
  • In the left join, all rows from the left DataFrame (df_employees) are retained, even if there's no match in the right DataFrame (df_salaries). Missing values in the right DataFrame are filled with NaN.
  • This example illustrates how merging is used to combine datasets with related but not identical information.

3. The join() Function

Purpose: The join() function is used to combine two DataFrames based on their indices or a specified key column. It is particularly useful for index-based merges and simplifies the process when the indices of both DataFrames are aligned or need alignment.

Key Features:

Limitations:

  • Primarily works on the index and may require customization for column-based merging.

Example Code:

 
import pandas as pd

# Create two DataFrames: one with employee details and another with salary information
df_employees = pd.DataFrame({
    'Employee ID': [101, 102, 103, 104],
    'Name': ['Fatimah Al-Zahrani', 'Ahmed Al-Zahrani', 'Mohammed Al-Zahrani', 'Sheikha Al-Zahrani'],
    'Department': ['Finance', 'Human Resources', 'IT', 'Marketing']
})

df_salaries = pd.DataFrame({
    'Employee ID': [101, 102, 103, 105],
    'Salary': [5000, 6000, 7000, 8000],
    'Bonus': [500, 400, 600, 700]
})

# Join the DataFrames on 'Employee ID'
result = df_employees.set_index('Employee ID').join(df_salaries.set_index('Employee ID'))

print(result)

      

Output :

                    Name            Department  Salary  Bonus
Employee ID                                           
101      Fatimah Al-Zahrani       Finance    5000    500
102      Ahmed Al-Zahrani   Human Resources    6000    400
103  Mohammed Al-Zahrani           IT    7000    600
104      Sheikha Al-Zahrani   Marketing    NaN     NaN
      

Explanation:

  • The join() function is used here to combine the df_employees and df_salaries DataFrames using their common column Employee ID as the index. Both DataFrames are first set to have Employee ID as their index using set_index().
  • The rows from the df_employees DataFrame are retained, while the corresponding salary and bonus data from the df_salaries DataFrame is added based on matching Employee ID.
  • For Employee ID 104, there is no corresponding entry in df_salaries, so the resulting columns for Salary and Bonus are NaN for that row.

Summary Comparison Between concat(), merge(), and join()

Feature concat() merge() join()
Purpose Merges DataFrames along a specified axis (row-wise or column-wise). Performs SQL-like joins based on common columns or indices. Joins DataFrames using their index.
Primary Use Case Stacking or appending DataFrames. Combining datasets with key-based relationships. Efficiently joining on index without manual key matching.
Default Behavior Concatenates along rows (axis=0). Performs an inner join by default. Performs a left join by default.
Join Type Supported Not applicable (direct concatenation). inner, outer, left, right. inner, outer, left, right.
Index Handling Keeps original index unless ignore_index=True is used. Can merge on index or column with on, left_on, right_on. Uses index by default but can specify a column with on=.
Duplicates Handling Retains all rows unless manually removed. Keeps duplicate rows unless handled separately. Duplicates remain unless manually handled.
NaN Handling Preserves NaN values unless fillna() is used. Introduces NaNs for non-matching keys in outer joins. Introduces NaNs for missing index matches unless how='inner'.
MultiIndex Support Yes, supports MultiIndex concatenation. Yes, supports MultiIndex merging. Yes, primarily for index-based operations.
Performance Faster for simple stacking operations. Slower due to key-based lookups. Faster than merge() for index-based joins.
Example Usage pd.concat([df1, df2], axis=0) df1.merge(df2, on='key', how='inner') df1.join(df2, how='left')

Test Your Knowledge

Question 1: Which function would you use to combine two DataFrames based on a common column?

Question 2: Which function is best for combining DataFrames along a particular axis, without considering index or column labels?

Question 3: Which function is more suited for combining DataFrames based on the index?

Question 4: Which function allows you to specify the type of join (inner, outer, etc.)?

Question 5: Which function would you use to combine DataFrames with different columns but the same index?